Decentralized Multi-Agent Reinforcement Learning in Average-Reward Dynamic DCOPs (Theoretical Proofs)
نویسندگان
چکیده
In this document, we show the proofs for the theoretical results described in the paper titled “Decentralized Multi-Agent Reinforcement Learning in Average-Reward Dynamic DCOPs” submitted to AAAI 2014. In this paper, we consider MDPs where a joint state can transition to any other joint state with non-zero probability, that is, the MDP is unichain. We are going to show the decomposability of the value function of a unichain MDP, thereby leading us to the property that the Distributed RVI Q-learning algorithm converges to an optimal solution. It is known that there always exists an optimal solution for a given unichain MDP and this solution can be characterized by the V ∗(s) value: Theorem 1. (Puterman, 2005) There exists an optimal Q-valueQ∗(s,d) for each joint state s and joint action a in an average-reward unichain MDP with bounded reward function satisfying: Q(s,d) + ρ = F(s,d) + ∑ s′ P (s, s,d) max d′∈D Q(s,d) (1) To simplify the analysis, we assume that the sets of global joint states S and global joint values D are finite, and that the Markov chains for all the agents, induced by any policy, are aperiodic. A Markov chain is aperiodic when it converges to its stationary distribution in the limit (Puterman, 2005). Additionally, there exists a unique V-value V ∗(s) = maxd∈DQ(s,d) for each joint state s such that V (s) + ρ = max d∈D [F(s,d) + ∑ s′ P (s, s,d)V (s)] (2) with V ∗(s0) = 0 for any initial state s. To help us to prove the decomposability of the value function, we first show the decomposibility of average reward ρ∗ of a given optimal policy: Lemma 1. For a given unichain MDP, the optimal average reward ρ∗ = ∑m i=1 ρ ∗ i can be decomposed into a sum of local average rewards ρi for each reward function fi ∈ F . PROOF SKETCH OF LEMMA 1: For a given unichain MDP, there always exists a stationary distribution P(s) of the global joint state s ∈ S in the limit, where π is the converged global joint policy. Hence, we have the existence of ρi = ∑ s∈S P(s)fi(si,di | si ∈ s,di = π(s)) for each reward function fi ∈ F . From the decomposability of average reward given by Lemma 1 and the characteristic of V ∗ value given in Theorem 1, we now prove the decomposability of V ∗ as follows: Definition 1. P̄i(s, s,di) is the probability of transitioning to joint state s′ from joint state s given joint value di and other values following policy Φ with V ∗ j (s 0 j ) = 0 for each reward function fj ∈ F . Theorem 2. There exists V ∗ i (s) = Qi ( s,di | di ∈ argmaxd∈DQ(s,d) ) and ρi for each reward function fi ∈ F under an optimal policy Φ(s) = di ∈ argmaxd∈DQ ∗(s,d) such that V ∗ i (s) + ρ ∗ i = fi(si,di | si ∈ s,di ∈ argmax d∈D Q(s,d)) + ∑ s′ P̄i(s , s,di | di ∈ argmax d∈D Q(s,d))V ∗ i (s ) (3) and V ∗(s) = ∑ i V ∗ i (s). PROOF SKETCH OF THEOREM 2: We do not show how to decompose Q∗(s,d) into Qi (s,di) but only show that there exists such a decomposition. The proof is based on the uniqueness of an optimal solution for any unichain MDP, which is given by Theorem 1. Step 1: We first propose a modified MD-DCOP and decompose it into a set of subproblems, where each subproblem has a corresponding reward function fi. Step 2: Suppose we know the optimal policy of the original problem, which always exists due to Theorem 1. Then, for each subproblem in the modified problem, if we were to fix the other variables (that are not in the subproblem) according to the optimal policy of the original problem, we can then compute the decomposed optimal Q-values Qi . Additionally, Theorem 1 guarantees the existence of these decomposed optimal Q-values. Step 3: Next, we show that the global optimal Q-values (sum of the decomposed optimal Q-values) of the modified MD-DCOP is the same as the global optimal Qvalues of the original MD-DCOP. Step 4: Finally, we show how to decompose the global optimally Q-values, which concludes the proof. Step 1: Consider a modified MD-DCOP where the transition probabilities are the same as the original MD-DCOP, but the reward functions for each joint state s and joint value di are defined as follows: f̄i(s,di) = { fi(si,di | si ∈ s) if di ∈ argmaxd∈DQ(s,d) −C otherwise (4) where C is a very large constant. Step 2: We now show the existence of the decomposed Q-values Q̄i (s,di) for each reward function fi. First, set the policy of every other variable that is not in the subproblem defined by reward function fi to their respective optimal policy in the original MD-DCOP. Also set the transition probabilities P̄i(s, s,di) according to the premise of Theorem 2 and set the reward functions f̄i(s,di) according to Equation 4. According to Theorem 1, there exists a decomposed Q-value Q̄i for this subproblem such that Q̄i (s,di) + ρ ∗ i = f̄i(s,di) + ∑ s′ P̄i(s , s,di) Q̄ ∗ i ( s,di | di ∈ argmax d′∈D Q(s,d) ) (5) where ρi corresponds to the local average reward of the subproblem, as shown in Lemma 1. Step 3: Then, for the globally optimal joint value d∗ = argmaxd∈DQ(s,d), let di to denote the local joint value in d∗, Q̄∗(s,d∗) = ∑ i Q̄ ∗ i (s,d ∗ i ), and F̄(s,d ∗) = ∑ i f̄i(s,d ∗ i ). Summing over all subproblems, we get Q̄(s,d) + ρ
منابع مشابه
Decentralized multi-agent reinforcement learning in average-reward dynamic DCOPs
Researchers have introduced the Dynamic Distributed Constraint Optimization Problem (Dynamic DCOP) formulation to model dynamically changing multi-agent coordination problems, where a dynamic DCOP is a sequence of (static canonical) DCOPs, each partially different from the DCOP preceding it. Existing work typically assumes that the problem in each time step is decoupled from the problems in oth...
متن کاملTheoretical considerations of potential-based reward shaping for multi-agent systems
Potential-based reward shaping has previously been proven to both be equivalent to Q-table initialisation and guarantee policy invariance in single-agent reinforcement learning. The method has since been used in multi-agent reinforcement learning without consideration of whether the theoretical equivalence and guarantees hold. This paper extends the existing proofs to similar results in multi-a...
متن کاملStochastic dominance in stochastic DCOPs for risk-sensitive applications
Distributed constraint optimization problems (DCOPs) are well-suited for modeling multi-agent coordination problems where the primary interactions are between local subsets of agents. However, one limitation of DCOPs is the assumption that the constraint rewards are without uncertainty. Researchers have thus extended DCOPs to Stochastic DCOPs (SDCOPs), where rewards are sampled from known proba...
متن کاملDynamic Safe Interruptibility for Decentralized Multi-Agent Reinforcement Learning
In reinforcement learning, agents learn by performing actions and observing their 1 outcomes. Sometimes, it is desirable for a human operator to interrupt an agent 2 in order to prevent dangerous situations from happening. Yet, as part of their 3 learning process, agents may link these interruptions, that impact their reward, to 4 specific states and deliberately avoid them. The situation is pa...
متن کاملExtending Hierarchical Reinforcement Learning to Continuous-Time, Average-Reward, and Multi-Agent Models
Hierarchical reinforcement learning (HRL) is a general framework that studies how to exploit the structure of actions and tasks to accelerate policy learning in large domains. Prior work on HRL has been limited to the discrete-time discounted reward semi-Markov decision process (SMDP) model. In this paper we generalize the setting of HRL to averagereward, continuous-time and multi-agent SMDP mo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014